Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[storage] Atomic alter table in the storage controller #31408

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Feb 7, 2025

This avoids a possible race in the storage controller, where a downgrade of our read capability can race the selection of new capabilities for a new version of the table. (More detail in a comment downthread.)

We instead switch to:

  • using the frontiers of the old version directly;
  • doing the frontier selection and the state update atomically while holding the lock;

...which together should avoid this particular race.

Motivation

https://github.com/MaterializeInc/database-issues/issues/8952

Tips for reviewer

This bug is quite tricky to reproduce, so I'm not 100% confident that this is the issue we're seeing in CI. Happy to take suggestions for improving the tests! But also please check my logic below: that it sounds like an accurate description of the current behaviour, and that the particular interleaving of events it imagines would explain the symptoms we see.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@bkirwi bkirwi force-pushed the alter-table branch 2 times, most recently from a4ee72f to 016a188 Compare February 14, 2025 00:10
@bkirwi
Copy link
Contributor Author

bkirwi commented Feb 14, 2025

A hypothesis

At the storage collections level, alter table is roughly:

  1. Grab a snapshot of the frontiers of the previous version, as known to the storage controller.
  2. Evolve the schema of the shard.
  3. Obtain new handles for the shard. Notably, these frontiers are not advanced: they’re based on whatever’s currently durable in Persist.
  4. Generate new metadata for the collection. The initial since hold of this collection is:
    1. the current critical since of the shard. (Read from Persist.)
    2. the overall frontier of the read capabilities obtained in step 1.
  5. Atomically modify the controller state: add the new version and point the old version to it. Among other things, this means that only changes to the new version’s frontiers get sent to the background worker and synced back to Persist.

Observations:

  • The frontier we get in 3 will generally be behind what we got in 1: since we write sinces back to Persist asynchronously. Even if we ignore the "maybe" in maybe_downgrade_since, there's an interval between when the frontier changes in our in memory state and when it gets dequeued by the background worker and written to consensus state.
  • However, it’s also possible for the frontier we get in 3 to be ahead, since this whole setup is not atomic, and we might have downgraded the critical handle in the meantime.

So: suppose we ALTER TABLE right as we downgrade the since of our table from t=10 to t=20, enqueueing but not applying a downgrade. In that case, we might see t=20 in step 1 but t=10 in step 3. Then, as we go to write the new metadata (with an overall since of t=10) into our storage collections state. At this point, the background worker might finally apply the downgrade to t=20, making t=10 permanently unreadable even though the storage controller still thinks it can issue handles at that frontier for the new version of the table.

@bkirwi bkirwi changed the title [wip] Debug alter table [storage] Atomic alter table in the storage controller Feb 14, 2025
@bkirwi bkirwi marked this pull request as ready for review February 14, 2025 18:25
@bkirwi bkirwi requested a review from a team as a code owner February 14, 2025 18:25
@bkirwi bkirwi requested a review from aljoscha February 14, 2025 21:12
Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this can happen! At first I was skeptical because I thought re-acquiring the since handle would protect us, because I mistakenly thought that someone somewhere would have to fail because of a mismatched since when trying to downgrade. But, we only compare against the epoch: when downgrading a critical since handle we don't check that the actual since frontier is what we expect. So in the alter table situation it can (does) happen that there are two valid critical since handles in the process that can downgrade the since (the one we freshly acquired, and the one in the background worker), and they both map to the one logical since handle (the one since handle ID).

Very nice find! 🙌

@bkirwi
Copy link
Contributor Author

bkirwi commented Feb 17, 2025

Great! So far the only nightly errors are either also present on main, or seem like flakes - since they don't alter any tables, so this code wouldn't run - but I'll try and confirm that before merging.

@bkirwi bkirwi merged commit 2519925 into MaterializeInc:main Feb 18, 2025
170 of 172 checks passed
@bkirwi bkirwi deleted the alter-table branch February 18, 2025 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants